Skip to content

[Feat] EASI-ER CLI#31

Open
oscarqjh wants to merge 242 commits intoEvolvingLMMs-Lab:mainfrom
oscarqjh:dev
Open

[Feat] EASI-ER CLI#31
oscarqjh wants to merge 242 commits intoEvolvingLMMs-Lab:mainfrom
oscarqjh:dev

Conversation

@oscarqjh
Copy link
Copy Markdown
Collaborator

@oscarqjh oscarqjh commented Mar 9, 2026

Summary

This PR adds the entire easi-er cli — a unified evaluation framework for embodied AI agents. It introduces subprocess-isolated simulators, multi-split task definitions, LLM-powered agents, and a CLI for running evaluations across multiple benchmarks.

Core Framework

  • Subprocess isolation: Each simulator runs in its own conda environment (potentially different Python version), communicating via filesystem IPC
  • Multi-split tasks: YAML-based task configs with template inheritance (extends), auto-discovered by registry
  • Pluggable agents: Dummy agent for testing, ReAct agent with multi-action buffering for real evaluation
  • AgentMemory + PromptBuilder: Shared state architecture with task-specific prompt builders following a standardized format
  • Parallel evaluation: Thread-pool parallelism with multi-instance vLLM support (--num-parallel, --vllm-instances)
  • Resume support: Interrupted runs can be resumed from logs/ output directory

Simulators (6)

  • AI2-THOR v2.1.0 (Python 3.8) — EB-Alfred
  • AI2-THOR v5.0.0 (Python 3.10) — EB-Navigation, AI2-THOR Rearrangement
  • AI2-THOR v3.3.5 (Python 3.10) — ManipulaTHOR
  • Habitat-Sim v0.1.7 (Python 3.8) — VLN-CE R2R, VLN-CE RxR
  • Habitat-Sim v0.3.0 (Python 3.9) — EB-Habitat, LHPR-VLN
  • CoppeliaSim v4.1.0 (Python 3.10) — EB-Manipulation
  • TDW v1.11.23 (Python 3.10) — HAZARD

Benchmarks (10)

  • EmbodiedBench: EB-Alfred (6 splits), EB-Navigation (5 splits), EB-Habitat (4 splits), EB-Manipulation (4 splits)
  • HAZARD: Fire, Flood, Wind scenarios
  • ManipulaTHOR: Arm point navigation (seen/unseen)
  • AI2-THOR Rearrangement 2023: 5 evaluation splits
  • VLN-CE R2R: Vision-and-language navigation (val seen/unseen)
  • VLN-CE RxR: Multilingual VLN (val seen/unseen, en/hi/te)
  • LHPR-VLN: Multi-subtask navigation (val/test splits)

LLM Infrastructure

  • LLMClient: LiteLLM wrapper supporting any backend (OpenAI, Anthropic, Gemini, vLLM)
  • ServerManager / MultiServerManager: vLLM subprocess lifecycle with GPU allocation, tensor parallelism, and port auto-probing
  • GPU isolation: --vllm-gpus and --sim-gpus for separating LLM inference from simulator rendering

Standardized Prompt Format

  • EASI Standard Prompt Format Reference (docs/easi-prompt-format-reference.md)
  • All non-EmbodiedBench prompt builders aligned: standard section headers, 4-field JSON response format, consistent action history format
  • EmbodiedBench benchmarks retain their original published formats

CLI

easi task list / info / download / scaffold
easi env list / install / check
easi sim test
easi start --agent react --backend openai --model gpt-4o
easi start --resume ./logs//<run_id>

Testing

  • 824 tests, all passing
  • All tests run offline without simulators or LLMs (mocked subprocess bridges)

Implement the easi Python library for orchestrating simulator-based
embodied reasoning evaluation. Includes subprocess isolation via
filesystem IPC, versioned simulator management (conda+uv), agent
interface, task/benchmark framework, and CLI.

Components: core abstractions, dummy/AI2-THOR simulators, dummy task,
dummy agent, LLM client + dummy server, full test suite (44 tests).
Add embodied agent evaluation pipeline with real simulator support:
- ReAct agent with multi-action buffering and PromptBuilder protocol
- EB-Alfred task support (6 splits via multi-split YAML discovery)
- AI2-THOR v2.1.0 bridge with skill-based actions and state tracking
- EvaluationRunner with structured output: <output_dir>/<task>/<run_id>/
- Per-episode artifacts: result.json, trajectory.jsonl, rgb_*.png
- Centralized logging (print -> logger), --verbosity CLI option
- Subprocess observability: bridge output streaming, Ctrl+C cleanup
- LLM API client (OpenAI-compatible) and dummy LLM server
- 106 tests passing
Split the monolithic bridge into a generic AI2ThorBridge base class
(simulator layer) and an EBAlfredBridge subclass (task layer), so that
future benchmarks using ai2thor==2.1.0 can reuse the simulator bridge
without rewriting controller management, IPC, or navigation helpers.

- Extract EB-Alfred goal evaluation and task loading into
  easi/tasks/ebalfred/thor_utils.py
- Trim easi/simulators/ai2thor/v2_1_0/thor_utils.py to generic-only
  constants and object query utilities
- Refactor bridge.py from EBAlfredBridge (1062 lines) to generic
  AI2ThorBridge (~314 lines) with configurable simulator_kwargs
- Create easi/tasks/ebalfred/bridge.py with EBAlfredBridge subclass
  containing all skill execution, state tracking, and goal evaluation
- Add get_bridge_script_path() and simulator_kwargs to BaseTask and
  TaskProtocol; override in EBAlfredTask
- Update EvaluationRunner to prefer task-specific bridge paths and
  forward simulator_kwargs
- Add simulator_kwargs to all EB-Alfred YAML configs
- Add 29 tests covering imports, inheritance, method separation,
  bridge path resolution, and simulator_kwargs
Unified client with generate() and generate_structured() methods,
lazy imports, and cumulative usage tracking (tokens + cost).
Manages start/stop, port checking, health polling with timeout,
and context manager support. Extensible for future backends.
New arguments for `easi run` to select LLM backend and configure
inference server. Backward compatible with existing --llm-url.
Runner now resolves backend, auto-starts vLLM when needed,
creates LLMClient for non-legacy backends, wraps structured
output, and tracks LLM usage per-episode and per-run.
- Track usage in generate_structured() via instructor's _raw_response
- Fix log file handle leak in ServerManager (store and close in stop())
- Remove duplicate agent_config computation in runner._create_agent()
…port

Replace the tightly-coupled agent/prompt design with a memory-based
architecture where AgentMemory holds shared state, PromptBuilder reads
from memory to construct prompts and parse responses, and the agent is
a thin orchestrator.

Key changes:
- AgentMemory + StepRecord dataclasses as shared agent state
- New PromptBuilderProtocol: build_messages(memory) + parse_response(response, memory)
- Simplified BaseAgent (removed _chat_history, abstract stubs, default act())
- ReActAgent rewritten as thin orchestrator delegating to builder
- EBAlfredPromptBuilder gains chat_history=True mode with VLMPlanner parity
- json_repair moved to easi/utils/ (old location re-exports)
- Removed stateless flag from agent config (builder controls mode)
Builder-owned schema enforcement: prompt builders can now optionally
implement get_response_format() to provide a JSON schema dict that gets
passed through to litellm. ReActAgent handles fallback automatically
when the backend doesn't support response_format.

- LLMClient.generate() accepts optional response_format param
- ReActAgent._generate_with_fallback() tries schema, caches on failure
- EBAlfredPromptBuilder.get_response_format() returns vlm_generation_guide
- Remove dead code: instructor dep, Pydantic schemas, monkey-patching
Add BaseTask.on_episode_reset() hook for task-specific post-reset setup.
EBAlfredTask overrides it to update agent action space from bridge metadata,
removing EB-Alfred-specific logic from the general EvaluationRunner.
…onfig

- trajectory.jsonl: add llm_response field to each step entry
- result.json: add instruction field for each episode
- config.json: include all CLI options and full task YAML config
Retry: LLMClient passes num_retries to litellm.completion() for
automatic exponential backoff on transient errors (timeouts, rate
limits). Configurable via --max-retries (default 3).

Resume: --resume <run_dir> loads config.json from a previous run,
skips completed episodes, clears and re-runs the last episode (which
may have been interrupted), then continues the remaining episodes.
All CLI options are restored from config.json so only --resume is needed.
…ps bug

Fix max_steps mismatch where YAML configured 50 but vendor EBAlfEnv
hardcoded 30. Now max_steps flows from YAML through simulator_kwargs
to the bridge and vendor env.

Add per-episode retry in EvaluationRunner: on crash (e.g. AI2-THOR
Unity segfault), the episode dir is cleared, the simulator is
re-launched, and the episode is retried up to max_retries times.
If all retries are exhausted the episode is recorded as failed and
the runner continues to the next episode.
Integrate EmbodiedBench EB-Navigation into EASI with vendored env,
task bridge, prompt builder, and 5 split configs (ai2thor v5.0.0).
Remove action_space field from all YAML configs and TaskEntry. Tasks
now define their action space via _build_action_space() override with
caching, eliminating the confusing pattern of empty YAML fields.
…atform

Replace stub bridge with working AI2ThorV5Bridge class that starts a
real controller, handles scene reset and discrete navigation actions.
Switch platform from CloudRendering to Linux64. Increase sim test
timeout default from 30s to 200s for THOR startup.
Fix OUTPUT_TEMPLATE trailing spaces on 3 lines and regenerate
navigation_examples.json from source to fix line continuation artifact
and curly quote mismatch. Verified character-level parity.
- Add habitat_sim:v0_3_0 simulator registration (conda env + manifest)
- Vendor EBHabEnv from EmbodiedBench with fixed imports
- Add EBHabitatTask with dynamic action space via on_episode_reset hook
- Add EBHabitatPromptBuilder matching VLMPlanner prompt construction
- Add 6 per-split YAML configs (base, common_sense, complex_instruction,
  spatial_relationship, visual_appearance, long_horizon)
- Add 26 offline tests for actions, task, prompts, and registry
- Move EB-Habitat-specific deps (gym, hydra-core, omegaconf, imageio,
  habitat-lab) from simulator requirements.txt to task YAML additional_deps
Add --redownload to 'easi task download' and 'easi run' to force
re-download of cached HuggingFace datasets. Useful when a previous
download was interrupted or incomplete.
oscarqjh and others added 30 commits March 13, 2026 17:51
…ridge

Add trajectory visualization hooks to the EB-Habitat bridge using
habitat-sim 0.3.0 API (articulated_agent.base_pos). Includes topdown
map rendering via pathfinder, start position persistence, and per-step
agent position tracking in trajectory info.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…prompt fix

- react_agent: force stop after N consecutive fallbacks (configurable via
  max_consecutive_fallbacks in YAML, default 0 = disabled)
- react_agent: reprompt correction message now comes from prompt builder
  (fixes SFT builder getting JSON-format correction prompt)
- metrics: add early_stop_rate to generic_aggregate
- runner: record forced_early_stop in per-episode results
- llm/client: support skip_special_tokens via extra_body for vLLM backend
- llm/models/internvl3: bypass model.chat() when skip_special_tokens=False
  to preserve SFT action tokens that get stripped by hardcoded decode
- llm/models/qwen3_vl: read skip_special_tokens from kwargs
- sft prompt builder: add get_reprompt_message() for action token format
- _base.yaml: add max_consecutive_fallbacks: 3
- prompt_builder: validate PNG with PIL before base64 encoding, retry
  with exponential backoff (0.1s, 0.2s, 0.4s) if truncated, send
  anyway after retries to trigger episode restart
- sft prompt builder: switch to interleaved content blocks (image_url
  at correct positions instead of all-first with <image> in text),
  matching InternVL chat template convention
- test: use real PIL image instead of fake PNG header bytes
ReActAgent captures the flattened prompt preview (`--- {role} ---`
blocks with `[img_N]` markers) alongside llm_response each time it
queries the LLM. Runner threads it into trajectory.jsonl under a new
"prompt" field next to "llm_response". Buffered / fallback-only
steps keep prompt = null.
When MIRROR_DEBUG_DIR is set, the mirror prompt builder writes every
flipped PNG it emits (current 3 slots + full history buffer) under that
directory with step+slot naming. Lets us A/B against sim-rendered
originals to verify the mirror transform and slot swap. Paired with
debug_mirror_one_episode.sh which runs one episode + stages the dumps
into the episode dir next to the originals.
SFT checkpoints emit action tokens <|forward|>/<|left|>/etc. vLLM's
OpenAI chat endpoint strips them at decode by default, which causes
_parse_sft_response to return empty and every step falls back to
move_forward. Set skip_special_tokens=false in _base.yaml
generation_kwargs so the tokens survive.
Adds LHPRVLNEnhancedSFTPromptBuilder that applies fantasy-vln's
display_env preprocessing (contrast 1.5 + resize 366) at prompt-build
time. Subclass of LHPRVLNSFTPromptBuilder; kwargs enhance_contrast and
resize_to are yaml-tunable. Ships new tasks for val_filtered and
test_filtered splits plus 9 unit tests covering contrast math, resize,
RGBA passthrough, and structural parity with the parent builder.
Adds save_rgba + enhance_images toggles to the LHPR-VLN bridge so the
enhancement (contrast 1.5 + resize 366) happens BEFORE the PNG hits
disk, matching fantasy-vln's display_env pipeline exactly. Paired with
the baseline SFT prompt builder (no in-prompt transform), the on-disk
image and the VLM input are both the 366x366 RGBA-enhanced frame.
Defaults off — existing yamls unaffected.
LHPRVLNBridge.reset() called self._scene_sim.actor('move_forward') to
fetch an initial observation. That call applied a second move_forward
(on top of the one already inside SceneSimulator.__init__), shifting
the agent's starting pose by 0.25 m along its facing direction.
Because actor() on step=-1 returns early without refreshing self.info,
the reset's logged pose stayed stale while the agent had actually
moved, so step 0 -> step 1 in trajectory.jsonl showed a phantom 0.25
m translation on a pure turn_left.

SceneSimulator.__init__ already captures the initial observation; use
it directly. Matches fantasy-vln's agent loop which starts with
task_sim.observations without calling actor() first.

Impact: EASI episodes now start at the same pose as fantasy for a
given scene/seed, enabling apples-to-apples SR comparison. Prior EASI
eval numbers were computed with the +0.25 m bias and are not directly
comparable.
Duplicates the bridge-enhanced val task under a short name so autoeval
treats it as a fresh task key and reruns checkpoints against the
post-bridge-fix code without colliding with prior bridge_enhanced
summaries. Temporary — safe to delete when done.
SceneSimulator.actor() had an early return on self.step == -1 that
absorbed the first call without incrementing bookkeeping or running
the max_step check. Combined with b6eb3c6 (bridge no longer calls
actor on reset), the runner's for step in range(max_steps) loop left
self.step one short of max_steps, so the timeout branch never fired
and nav_errors / nav_steps were never appended for episodes that ran
to the wall clock. aggregate_results then died with IndexError in
NavigationMetrics.TAR() because gt_path outlasted error_length.

This also silently masked a pre-existing bug: a first-action 'stop'
was dropped (skipped sim.step and took the early return), never
advancing the stage or recording a nav_error.

Fix: initialise self.step = 0 at end of __init__ and remove the
early-return block from actor(). Every call now goes through the full
step-increment + info-refresh + oracle/stop/max pipeline. Semantic
shift: nav_steps[0] is now +1 vs pre-fix runs because it includes the
init move_forward; fantasy-vln's nav_steps already carry the same +1
offset (their while-not-episode-over loop makes one extra actor call
past max_steps), so post-fix EASI is more comparable to fantasy.

Adds tests/test_lhpr_vln_scene_step.py covering timeout length
invariants, all-subtasks-success accounting, first-action-stop being
honoured, and mixed partial success + timeout — all via a
habitat_sim stub so they run offline in the main venv.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant